Skip to content

PERF: Avoid Series constructor in DataFrame(dict(...), columns=) #57205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 24, 2024

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Feb 2, 2024

import pandas as pd, numpy as np
a = pd.Series(np.array([1, 2, 3]))
columns = range(10_000)
data = {i: a for i in columns}
%timeit pd.DataFrame(data, columns=columns)

51.3 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # PR
102 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # main
In [1]: import pandas as pd

In [2]: %timeit pd.DataFrame([], columns=['a'])
37.3 µs ± 231 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # PR
183 µs ± 584 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # main

@mroeschke mroeschke added Performance Memory or execution speed performance DataFrame DataFrame data structure labels Feb 2, 2024
@mroeschke mroeschke added this to the 3.0 milestone Feb 2, 2024
@mroeschke
Copy link
Member Author

When you have a chance, could you take a look @jbrockmendel

@jbrockmendel
Copy link
Member

will take a look. i like the idea of getting Series out of here

continue
array = data_values[idx]
arrays[i] = array
if is_scalar(array) and isna(array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is if a user specifically passes e.g. {key: pd.NaT}? Do we have tests for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, doesn't look like we have one explicitly so I'll add one

@jbrockmendel
Copy link
Member

LGTM

@mroeschke mroeschke merged commit 9a6c8f0 into pandas-dev:main Feb 24, 2024
@mroeschke mroeschke deleted the ref/dict_to_mgr branch February 24, 2024 00:13
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…das-dev#57205)

* Avoid Series constructor inference in dict_to_mgr

* test_constructors passes

* Use construct_1d_arraylike_from_scalar

* PERF: Avoid Series constructor in DataFrame(dict(...), columns=)

* Fix whitespace and comment

* typing

* Just ignore

* add bug fix and test

* don't overwrite dtype
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty dataframe creation DataFrame constructor with dict of series misbehaving when columns specified
2 participants